Conversation
|
Ran some inference benchmark on an A100 gpu to compare multi-query (MQ) with multi-head (MH) attention. I used: Some findings: Some timers slow-down the inference, significantly more for the MH model than for the MQ modelTimes are in ms. Only a timer for the whole model: The difference of 3 seconds (in favour of MQ) with a timer on the whole model jumps to 6 seconds when using timers within each layer for some reason. The timers use We end up with a reduction of 26% on the transformer-forward stepwhen comparing 13884.54 against 10263.32. |
#1
TODO: